Add Parquet variant shredding support#332
Conversation
There was a problem hiding this comment.
Pull request overview
Adds end-to-end Parquet shredded-variant support (reader + producer) under Apache.Arrow.Operations.Shredding, with supporting enhancements to Arrow Variant scalar/array APIs and conformance fixtures converted to Arrow IPC for CI.
Changes:
- Introduces
Apache.Arrow.Operations.Shreddingtypes (e.g.,ShredType,ShredOptions, and shared helpers) to represent and operate on shreddedtyped_valuelayouts. - Extends Variant scalar tooling with cross-metadata transcoding support (
VariantValueWriter.CopyValue) and a metadata prepass helper (VariantMetadataBuilder.CollectFieldNames). - Adds a regeneration script and checks in Arrow IPC fixtures converted from the Parquet shredded-variant corpus.
Reviewed changes
Copilot reviewed 29 out of 166 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| test/shredded_variant_ipc/regen.py | Script to regenerate Arrow IPC fixtures from the parquet-testing shredded-variant corpus. |
| test/shredded_variant_ipc/case-*.arrow (many files) | Checked-in Arrow IPC fixtures generated from the shredded-variant Parquet test corpus. |
| src/Apache.Arrow.Scalars/Variant/VariantValueWriter.cs | Adds CopyValue(VariantReader) to transcode values while re-resolving field IDs against a target metadata dictionary. |
| src/Apache.Arrow.Scalars/Variant/VariantValue.cs | Adds FromDecimal16(SqlDecimal) to preserve Decimal16 intent and support values beyond decimal range. |
| src/Apache.Arrow.Scalars/Variant/VariantMetadataBuilder.cs | Adds CollectFieldNames(VariantReader) for two-pass encode workflows. |
| src/Apache.Arrow.Operations/Shredding/ShreddingHelpers.cs | Adds shared helper to construct per-row ShreddedVariant slots from element-group structs. |
| src/Apache.Arrow.Operations/Shredding/ShredType.cs | Defines the shredding type system for typed_value columns (primitive + object/array). |
| src/Apache.Arrow.Operations/Shredding/ShredOptions.cs | Adds schema inference tuning options (depth, frequency, type consistency). |
| src/Apache.Arrow.Operations/Apache.Arrow.Operations.csproj | Adds a project reference to Apache.Arrow to support shredding operations over Arrow arrays/types. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| SqlDecimal normalized = SqlDecimal.ConvertToPrecScale(value, 38, value.Scale); | ||
| return new VariantValue(VariantPrimitiveType.Decimal16, (object)normalized); | ||
| } | ||
| return new VariantValue(VariantPrimitiveType.Decimal16, (object)value.Value); |
There was a problem hiding this comment.
FromDecimal16(SqlDecimal) converts to decimal via value.Value when value.Data[3] == 0. SqlDecimal.Value can still throw for values that aren't representable as System.Decimal (e.g., scale/precision beyond decimal’s limits) even when the magnitude fits in 96 bits. Consider storing the SqlDecimal instance in those cases (or using a try/catch fallback) so Decimal16 materialization can’t unexpectedly overflow.
| return new VariantValue(VariantPrimitiveType.Decimal16, (object)value.Value); | |
| try | |
| { | |
| return new VariantValue(VariantPrimitiveType.Decimal16, (object)value.Value); | |
| } | |
| catch (OverflowException) | |
| { | |
| SqlDecimal normalized = SqlDecimal.ConvertToPrecScale(value, 38, value.Scale); | |
| return new VariantValue(VariantPrimitiveType.Decimal16, (object)normalized); | |
| } |
There was a problem hiding this comment.
I think in a followup change I'm going to always store a Decimal16 as a SqlDecimal and vice versa. The current "convert to decimal if it fits" strategy is unnecessarily complicated. Filed #33 to cover this.
| { | ||
| StructType elementGroupType = (StructType)elementGroup.Data.DataType; | ||
| int valueIdx = elementGroupType.GetFieldIndex("value"); | ||
| int typedIdx = elementGroupType.GetFieldIndex("typed_value"); |
There was a problem hiding this comment.
These should probably be cached; need to take a second look.
There was a problem hiding this comment.
(Shouldn't affect the public API, so can be done as a followup.)
adamreeve
left a comment
There was a problem hiding this comment.
I started reviewing this but didn't get very far, I'll just leave the couple of comments I have for now
adamreeve
left a comment
There was a problem hiding this comment.
This all looks good to me thanks Curt, only a few minor comments.
| BinaryArray metadataArr = metadataBuilder.Build(allocator); | ||
|
|
||
| // value column: residual bytes (or null). | ||
| BinaryArray valueArr = BuildBinaryColumn(rows, allocator); |
There was a problem hiding this comment.
Should we omit the value array if values are fully shredded? Probably fine to add that as an optimisation later though if there's a need for it.
There was a problem hiding this comment.
My concern with doing that is a hypothetical scenario where we're shredding a column in a very large table. We get the values as an IArrowArrayStream instead of an IArrowArray and we run each of the batches through ShredSchemaInferrer, leaving us with a ShredSchema. Now we take a second pass through the IArrowArrayStream and shred the batches, one at a time. Each of the batches will need to conform to the shredded schema and we can't just omit values in one of them without knowing whether or not it can be omitted in all of them.
In short, I think this would require a separate knob based on the bigger picture.
What's Changed
Implements the Parquet variant shredding spec end-to-end in a new
Apache.Arrow.Operations.Shreddingnamespace, alongside minor changes to the base scalar and array types.Operations.Shredding reader side:
ShreddedVariant/ShreddedObject/ShreddedArrayref-struct trio exposing typed columns and residual bytes side-by-side.VariantArrayShreddingExtensionsaddsGetShreddedVariant(i)andGetLogicalVariantValue(i)onVariantArray.ShredSchema.FromArrowTypederives a shredding schema from an Arrow typed_value type, rejecting unsupported types (uint32, fixed-size-binary(N≠16)).Operations.Shredding producer side:
VariantShredderdecomposes a column ofVariantValuesagainst aShredSchemainto shared metadata + per-rowShredResults.ShreddedVariantArrayBuilderassembles those into a shreddedVariantArraywith atyped_valueArrow tree matching the schema.Apache.Arrow changes:
VariantExtensionDefinitionacceptsstruct<metadata, value?, typed_value?>layouts in addition to the plain unshredded form.VariantTypegainsIsShredded/HasValueColumn/HasTypedValueColumn/TypedValueFieldproperties.VariantArray.GetVariantValueandGetVariantReaderthrow on shredded columns with a pointer to theOperations.Shreddingextensions.The public
VariantArray(IArrowArray)constructor now infers theVariantType(shredded or not) from the storage shape.Operations gains a project reference to Apache.Arrow; Apache.Arrow does not reference Operations.
Apache.Arrow.Scalars changes:
VariantValueWriter.CopyValue(VariantReader source)transcodes a reader into this writer, re-resolving field IDs against the writer's metadata dictionary. Supports cross-dictionary transcoding and multi-source merge-into-one-dictionary workflows.VariantMetadataBuilder.CollectFieldNames(VariantReader source)is the two-pass companion that accumulates source field names into the target metadata builder.Validation:
apache/parquet-testing(test/parquet-testing/shredded_variant/).test/shredded_variant_ipc/regen.pyconverts eachcase-NNN.parquetto an Arrow IPC file viapyarrow; 137 resulting .arrow files are checked in so CI needs no Python. All 128 valid conformance cases pass; 6 schema-invalid and data-invalid cases are rejected with clear errors; 3 "spec-invalid but permissive" INVALID cases are documented as read-without-throw.